1
-00:00:00,485 --> 00:00:05,515


2
00:00:08,355 --> 00:00:11,357
- Okay, it's after 12, so I think
we should get started.

3
00:00:14,644 --> 00:00:17,419
Today we're going to kind of pick up
where we left off last time.

4
00:00:17,419 --> 00:00:23,400
Last time we talked about a lot of sort of tips and tricks
involved in the nitty gritty details of training neural networks.

5
00:00:23,400 --> 00:00:30,439
Today we'll pick up where we left off, and talk about a lot more
of these sort of nitty gritty details about training these things.

6
00:00:30,439 --> 00:00:34,707
As usual, a couple administrative notes
before we get into the material.

7
00:00:34,707 --> 00:00:39,645
As you all know, assignment one is already
due. Hopefully you all turned it in.

8
00:00:39,645 --> 00:00:57,322
Did it go okay? Was it not okay? Rough sentiment? Mostly okay. Okay, that's good. Awesome. [laughs] We're in
the process of grading those, so stay turned. We're hoping to get grades back for those before A two is due.

9
00:00:57,322 --> 00:01:04,121
Another reminder, that your project proposals
are due tomorrow. Actually, no, today at 11:59.

10
00:01:04,959 --> 00:01:09,074
Make sure you send those in.
Details are on the website and on Piazza.

11
00:01:09,074 --> 00:01:15,269
Also a reminder, assignment two is already
out. That'll be due a week from Thursday.

12
00:01:15,269 --> 00:01:25,860
Historically, assignment two has been the longest one in the class, so if you haven't
started already on assignment two, I'd recommend you take a look at that pretty soon.

13
00:01:27,122 --> 00:01:32,484
Another reminder is that for assignment two, I
think of a lot of you will be using Google Cloud.

14
00:01:32,484 --> 00:01:38,586
Big reminder, make sure to stop your instances when you're not
using them because whenever your instance is on, you get charged,

15
00:01:38,586 --> 00:01:42,899
and we only have so many coupons
to distribute to you guys.

16
00:01:42,899 --> 00:01:52,223
Anytime your instance is on, even if you're not SSH to it, even if you're not running things
immediately in your Jupyter Notebook, any time that instance is on, you're going to be charged.

17
00:01:52,223 --> 00:01:57,118
Just make sure that you explicitly stop
your instances when you're not using them.

18
00:01:57,118 --> 00:02:04,970
In this example, I've got a little screenshot of my dashboard on Google Cloud.
I need to go in there and explicitly go to the dropdown and click stop.

19
00:02:04,970 --> 00:02:08,644
Just make sure that you do this when
you're done working each day.

20
00:02:09,481 --> 00:02:20,853
Another thing to remember is it's kind of up to you guys to keep track of your spending on Google
Cloud. In particular, instances that use GPUs are a lot more expensive than those with CPUs.

21
00:02:20,853 --> 00:02:28,322
Rough order of magnitude, those GPU instances are around 90
cents to a dollar an hour. Those are actually quite pricey.

22
00:02:28,322 --> 00:02:39,739
The CPU instances are much cheaper. The general strategy is that you probably want to make two instances,
one with a GPU and one without, and then only use that GPU instance when you really need the GPU.

23
00:02:39,739 --> 00:02:47,377
For example, on assignment two, most of the assignment, you should
only need the CPU, so you should only use your CPU instance for that.

24
00:02:47,377 --> 00:02:52,990
But then the final question, about
TensorFlow or PyTorch, that will need a GPU.

25
00:02:52,990 --> 00:02:58,897
This'll give you a little bit of practice with switching between
multiple instances and only using that GPU when it's really necessary.

26
00:02:58,897 --> 00:03:04,307
Again, just kind of watch your spending.
Try not to go too crazy on these things.

27
00:03:04,307 --> 00:03:07,748
Any questions on the administrative stuff
before we move on?

28
00:03:11,180 --> 00:03:12,182
Question.

29
00:03:12,182 --> 00:03:13,902
- [Student] How much RAM should we use?

30
00:03:13,902 --> 00:03:16,133
- Question is how much RAM should we use?

31
00:03:16,133 --> 00:03:21,863
I think eight or 16 gigs is probably good
for everything that you need in this class.

32
00:03:21,863 --> 00:03:27,114
As you scale up the number of CPUs and the number
of RAM, you also end up spending more money.

33
00:03:27,114 --> 00:03:34,542
If you stick with two or four CPUs and eight or 16 gigs of RAM, that
should be plenty for all the homework-related stuff that you need to do.

34
00:03:36,636 --> 00:03:40,417
As a quick recap, last time we
talked about activation functions.

35
00:03:40,417 --> 00:03:44,962
We talked about this whole zoo of different activation
functions and some of their different properties.

36
00:03:44,962 --> 00:03:59,736
We saw that the sigmoid, which used to be quite popular when training neural networks maybe 10 years ago or so, has this
problem with vanishing gradients near the two ends of the activation function. tanh has this similar sort of problem.

37
00:03:59,736 --> 00:04:09,230
Kind of the general recommendation is that you probably want to stick with ReLU for most cases
as sort of a default choice 'cause it tends to work well for a lot of different architectures.

38
00:04:09,230 --> 00:04:16,820
We also talked about weight initialization. Remember that up on
the top, we have this idea that when you initialize your weights

39
00:04:16,820 --> 00:04:23,787
at the start of training, if those weights are initialized to be
too small, then if you look at, then the activations will vanish

40
00:04:23,788 --> 00:04:29,583
as you go through the network because as you multiply by these small
numbers over and over again, they'll all sort of decay to zero.

41
00:04:29,583 --> 00:04:33,072
Then everything will be zero,
learning won't happen, you'll be sad.

42
00:04:33,072 --> 00:04:41,208
On the other hand, if you initialize your weights too big, then as you go through the
network and multiply by your weight matrix over and over again, eventually they'll explode.

43
00:04:41,208 --> 00:04:45,389
You'll be unhappy, there'll be no
learning, it will be very bad.

44
00:04:45,389 --> 00:04:58,531
But if you get that initialization just right, for example, using the Xavier initialization or the MSRA
initialization, then you kind of keep a nice distribution of activations as you go through the network.

45
00:04:58,531 --> 00:05:04,328
Remember that this kind of gets more and more important and
more and more critical as your networks get deeper and deeper

46
00:05:04,328 --> 00:05:11,620
because as your network gets deeper, you're multiplying by those weight
matrices over and over again with these more multiplicative terms.

47
00:05:11,620 --> 00:05:23,666
We also talked last time about data preprocessing. We talked about how it's pretty typical
in conv nets to zero center and normalize your data so it has zero mean and unit variance.

48
00:05:23,666 --> 00:05:29,968
I wanted to provide a little bit of extra intuition
about why you might actually want to do this.

49
00:05:29,968 --> 00:05:39,532
Imagine a simple setup where we have a binary classification problem where
we want to draw a line to separate these red points from these blue points.

50
00:05:39,532 --> 00:05:46,948
On the left, you have this idea where if those data points are kind
of not normalized and not centered and far away from the origin,

51
00:05:46,948 --> 00:05:55,007
then we can still use a line to separate them, but now if that line wiggles
just a little bit, then our classification is going to get totally destroyed.

52
00:05:55,007 --> 00:06:05,992
That kind of means that in the example on the left, the loss function is now extremely
sensitive to small perturbations in that linear classifier in our weight matrix.

53
00:06:07,315 --> 00:06:14,554
We can still represent the same functions, but that might make
learning quite difficult because, again, their loss is very sensitive

54
00:06:14,554 --> 00:06:25,351
to our parameter vector, whereas in the situation on the right, if you take that data cloud
and you move it into the origin and you make it unit variance, then now, again, we can still

55
00:06:25,351 --> 00:06:35,523
classify that data quite well, but now as we wiggle that line a little bit, then our
loss function is less sensitive to small perturbations in the parameter values.

56
00:06:35,523 --> 00:06:41,064
That maybe makes optimization a little bit
easier, as we'll see a little bit going forward.

57
00:06:41,064 --> 00:06:46,539
By the way, this situation is not only
in the linear classification case.

58
00:06:46,539 --> 00:06:57,756
Inside a neural network, remember we kind of have these interleavings of these linear
matrix multiplies, or convolutions, followed by non-linear activation functions.

59
00:06:59,078 --> 00:07:05,687
If the input to some layer in your neural network is not
centered or not zero mean, not unit variance, then again,

60
00:07:05,687 --> 00:07:15,632
small perturbations in the weight matrix of that layer of the network could cause large
perturbations in the output of that layer, which, again, might make learning difficult.

61
00:07:15,632 --> 00:07:20,481
This is kind of a little bit of extra intuition
about why normalization might be important.

62
00:07:21,864 --> 00:07:26,862
Because we have this intuition that normalization is
so important, we talked about batch normalization,

63
00:07:26,862 --> 00:07:36,030
which is where we just add this additional layer inside our networks to just
force all of the intermediate activations to be zero mean and unit variance.

64
00:07:36,030 --> 00:07:41,465
I've sort of resummarized the batch normalization equations
here with the shapes a little bit more explicitly.

65
00:07:41,465 --> 00:07:45,172
Hopefully this can help you out when you're
implementing this thing on assignment two.

66
00:07:45,172 --> 00:07:59,254
But again, in batch normalization, we have this idea that in the forward pass, we use the statistics of the mini batch
to compute a mean and a standard deviation, and then use those estimates to normalize our data on the forward pass.

67
00:07:59,254 --> 00:08:05,641
Then we also reintroduce the scale and shift
parameters to increase the expressivity of the layer.

68
00:08:05,641 --> 00:08:09,990
You might want to refer back to this
when working on assignment two.

69
00:08:09,990 --> 00:08:18,146
We also talked last time a little bit about babysitting the learning process,
how you should probably be looking at your loss curves during training.

70
00:08:18,146 --> 00:08:26,683
Here's an example of some networks I was actually training over the
weekend. This is usually my setup when I'm working on these things.

71
00:08:26,683 --> 00:08:35,795
On the left, I have some plot showing the training loss over time. You can see it's
kind of going down, which means my network is reducing the loss. It's doing well.

72
00:08:35,795 --> 00:08:48,464
On the right, there's this plot where the X axis is, again, time, or the iteration number,
and the Y axis is my performance measure both on my training set and on my validation set.

73
00:08:48,465 --> 00:08:58,680
You can see that as we go over time, then my training set performance goes up and up and up and up and
up as my loss function goes down, but at some point, my validation set performance kind of plateaus.

74
00:08:58,680 --> 00:09:05,066
This kind of suggests that maybe I'm overfitting in this situation.
Maybe I should have been trying to add additional regularization.

75
00:09:06,317 --> 00:09:09,504
We also talked a bit last time about
hyperparameter search.

76
00:09:09,504 --> 00:09:14,798
All these networks have sort of a large zoo of
hyperparameters. It's pretty important to set them correctly.

77
00:09:14,798 --> 00:09:20,725
We talked a little bit about grid search versus random search,
and how random search is maybe a little bit nicer in theory

78
00:09:20,725 --> 00:09:30,669
because in the situation where your performance might be more sensitive, with respect to one
hyperparameter than other, and random search lets you cover that space a little bit better.

79
00:09:30,669 --> 00:09:37,005
We also talked about the idea of coarse to fine search, where
when you're doing this hyperparameter optimization, probably you

80
00:09:37,005 --> 00:09:43,408
want to start with very wide ranges for your hyperparameters,
only train for a couple iterations, and then based on

81
00:09:43,408 --> 00:09:47,973
those results, you kind of narrow in on the
range of hyperparameters that are good.

82
00:09:47,973 --> 00:09:51,666
Now, again, redo your search in a
smaller range for more iterations.

83
00:09:51,666 --> 00:09:56,708
You can kind of iterate this process to kind of
hone in on the right region for hyperparameters.

84
00:09:56,708 --> 00:10:04,455
But again, it's really important to, at the start, have a very coarse range to
start with, where you want very, very wide ranges for all your hyperparameters.

85
00:10:04,455 --> 00:10:13,746
Ideally, those ranges should be so wide that your network is kind of blowing up at either end
of the range so that you know that you've searched a wide enough range for those things.

86
00:10:17,462 --> 00:10:18,295
Question?

87
00:10:20,044 --> 00:10:26,672
- [Student] How many [speaks too low to hear]
optimize at once? [speaks too low to hear]

88
00:10:31,840 --> 00:10:34,554
- The question is how many hyperparameters
do we typically search at a time?

89
00:10:34,554 --> 00:10:38,244
Here is two, but there's a lot more
than two in these typical things.

90
00:10:38,244 --> 00:10:45,442
It kind of depends on the exact model and the exact architecture, but because
the number of possibilities is exponential in the number of hyperparameters,

91
00:10:45,442 --> 00:10:48,012
you can't really test too many at a time.

92
00:10:48,012 --> 00:10:51,737
It also kind of depends on how
many machines you have available.

93
00:10:51,737 --> 00:10:55,745
It kind of varies from person to person
and from experiment to experiment.

94
00:10:55,745 --> 00:11:05,353
But generally, I try not to do this over more than maybe two or three or four at
a time at most because, again, this exponential search just gets out of control.

95
00:11:05,353 --> 00:11:10,406
Typically, learning rate is the really
important one that you need to nail first.

96
00:11:10,406 --> 00:11:19,542
Then other things, like regularization, like learning rate decay, model size, these
other types of things tend to be a little bit less sensitive than learning rate.

97
00:11:19,542 --> 00:11:22,723
Sometimes you might do kind of a block
coordinate descent, where you go and find

98
00:11:22,723 --> 00:11:27,459
the good learning rate, then you go back
and try to look at different model sizes.

99
00:11:27,459 --> 00:11:30,759
This can help you cut down on the
exponential search a little bit,

100
00:11:30,759 --> 00:11:35,370
but it's a little bit problem dependent on exactly which
ones you should be searching over in which order.

101
00:11:36,253 --> 00:11:38,120
More questions?

102
00:11:38,120 --> 00:11:57,041
- [Student] [speaks too low to hear] Another parameter, but then changing that other parameter, two or three other
parameters, makes it so that your learning rate or the ideal learning rate is still [speaks too low to hear].

103
00:11:57,041 --> 00:12:04,537
- Question is how often does it happen where when you change one hyperparameter,
then the other, the optimal values of the other hyperparameters change?

104
00:12:04,537 --> 00:12:11,339
That does happen sometimes, although for learning
rates, that's typically less of a problem.

105
00:12:11,339 --> 00:12:18,130
For learning rates, typically you want to get in a good range, and then set
it maybe even a little bit lower than optimal, and let it go for a long time.

106
00:12:18,130 --> 00:12:31,291
Then if you do that, combined with some of the fancier optimization strategies that we'll talk about today,
then a lot of models tend to be a little bit less sensitive to learning rate once you get them in a good range.

107
00:12:31,291 --> 00:12:32,962
Sorry, did you have a
question in front, as well?

108
00:12:32,962 --> 00:12:37,308
- [Student] [speaks too low to hear]

109
00:12:37,308 --> 00:12:41,292
- The question is what's wrong with having a small
learning rate and increasing the number of epochs?

110
00:12:41,292 --> 00:12:45,139
The answer is that it might take
a very long time. [laughs]

111
00:12:45,139 --> 00:12:48,383
- [Student] [speaks too low to hear]

112
00:12:48,383 --> 00:12:54,853
- Intuitively, if you set the learning rate very low and let it go
for a very long time, then this should, in theory, always work.

113
00:12:54,853 --> 00:13:00,491
But in practice, those factors of 10 or 100 actually
matter a lot when you're training these things.

114
00:13:00,491 --> 00:13:03,931
Maybe if you got the right learning rate,
you could train it in six hours, 12 hours

115
00:13:03,931 --> 00:13:11,911
or a day, but then if you just were super safe and dropped it by a factor of 10
or by a factor of 100, now that one-day training becomes 100 days of training.

116
00:13:11,911 --> 00:13:16,400
That's three months.
That's not going to be good.

117
00:13:16,400 --> 00:13:20,668
When you're taking these intro computer science classes, they
always kind of sweep the constants under the rug, but when

118
00:13:20,668 --> 00:13:25,444
you're actually thinking about training things,
those constants end up mattering a lot.

119
00:13:25,444 --> 00:13:26,861
Another question?

120
00:13:27,877 --> 00:13:33,385
- [Student] If you have a low learning
rate, [speaks too low to hear].

121
00:13:33,385 --> 00:13:37,807
- Question is for a low learning rate, are
you more likely to be stuck in local optima?

122
00:13:37,807 --> 00:13:42,601
I think that makes some intuitive sense, but in
practice, that seems not to be much of a problem.

123
00:13:42,601 --> 00:13:47,030
I think we'll talk a bit
more about that later today.

124
00:13:47,030 --> 00:13:53,151
Today I wanted to talk about a couple other really interesting
and important topics when we're training neural networks.

125
00:13:53,151 --> 00:13:59,655
In particular, I wanted to talk, we've kind of alluded to this fact
of fancier, more powerful optimization algorithms a couple times.

126
00:13:59,655 --> 00:14:07,067
I wanted to spend some time today and really dig into those and talk about what
are the actual optimization algorithms that most people are using these days.

127
00:14:07,067 --> 00:14:10,364
We also touched on regularization
in earlier lectures.

128
00:14:10,364 --> 00:14:15,806
This concept of making your network do additional
things to reduce the gap between train and test error.

129
00:14:15,806 --> 00:14:22,143
I wanted to talk about some more strategies that people are using
in practice of regularization, with respect to neural networks.

130
00:14:22,143 --> 00:14:26,401
Finally, I also wanted to talk a bit
about transfer learning, where you can

131
00:14:26,401 --> 00:14:31,490
sometimes get away with using less data than you
think by transferring from one problem to another.

132
00:14:32,821 --> 00:14:39,885
If you recall from a few lectures ago, the kind of core
strategy in training neural networks is an optimization problem

133
00:14:39,885 --> 00:14:50,982
where we write down some loss function, which defines, for each value of the network weights,
the loss function tells us how good or bad is that value of the weights doing on our problem.

134
00:14:50,982 --> 00:14:56,508
Then we imagine that this loss function gives
us some nice landscape over the weights,

135
00:14:56,508 --> 00:15:04,142
where on the right, I've shown this maybe small, two-dimensional
problem, where the X and Y axes are two values of the weights.

136
00:15:04,142 --> 00:15:07,984
Then the color of the plot kind of
represents the value of the loss.

137
00:15:07,984 --> 00:15:15,195
In this kind of cartoon picture of a two-dimensional problem,
we're only optimizing over these two values, W one, W two.

138
00:15:15,195 --> 00:15:23,203
The goal is to find the most red region in this case, which
corresponds to the setting of the weights with the lowest loss.

139
00:15:23,203 --> 00:15:29,099
Remember, we've been working so far with this extremely
simple optimization algorithm, stochastic gradient descent,

140
00:15:29,099 --> 00:15:32,393
where it's super simple, it's three lines.

141
00:15:32,393 --> 00:15:39,179
While true, we first evaluate the loss in
the gradient on some mini batch of data.

142
00:15:39,179 --> 00:15:44,656
Then we step, updating our parameter vector
in the negative direction of the gradient

143
00:15:44,656 --> 00:15:48,798
because this gives, again, the direction
of greatest decrease of the loss function.

144
00:15:48,798 --> 00:15:56,282
Then we repeat this over and over again, and hopefully we converge
to the red region and we get great errors and we're very happy.

145
00:15:56,282 --> 00:16:05,462
But unfortunately, this relatively simple optimization algorithm has
quite a lot of problems that actually could come up in practice.

146
00:16:05,462 --> 00:16:08,713
One problem with stochastic
gradient descent,

147
00:16:08,713 --> 00:16:18,969
imagine what happens if our objective function looks something like
this, where, again, we're plotting two values, W one and W two.

148
00:16:18,969 --> 00:16:23,472
As we change one of those values,
the loss function changes very slowly.

149
00:16:23,472 --> 00:16:26,687
As we change the horizontal value,
then our loss changes slowly.

150
00:16:28,152 --> 00:16:34,930
As we go up and down in this landscape, now our loss is
very sensitive to changes in the vertical direction.

151
00:16:34,930 --> 00:16:40,757
By the way, this is referred to as the loss
having a bad condition number at this point,

152
00:16:40,757 --> 00:16:46,050
which is the ratio between the largest and smallest
singular values of the Hessian matrix at that point.

153
00:16:46,050 --> 00:16:50,497
But the intuitive idea is that the loss
landscape kind of looks like a taco shell.

154
00:16:50,497 --> 00:16:54,393
It's sort of very sensitive in one direction,
not sensitive in the other direction.

155
00:16:54,393 --> 00:17:00,633
The question is what might SGD, stochastic gradient
descent, do on a function that looks like this?

156
00:17:05,310 --> 00:17:12,196
If you run stochastic gradient descent on this type of function,
you might get this characteristic zigzagging behavior,

157
00:17:12,197 --> 00:17:22,111
where because for this type of objective function, the direction of
the gradient does not align with the direction towards the minima.

158
00:17:22,112 --> 00:17:29,335
When you compute the gradient and take a step, you might step
sort of over this line and sort of zigzag back and forth.

159
00:17:29,335 --> 00:17:35,995
In effect, you get very slow progress along the horizontal
dimension, which is the less sensitive dimension, and you get this

160
00:17:35,995 --> 00:17:41,551
zigzagging, nasty, nasty zigzagging behavior
across the fast-changing dimension.

161
00:17:41,551 --> 00:17:50,139
This is undesirable behavior. By the way, this problem
actually becomes much more common in high dimensions.

162
00:17:51,186 --> 00:18:00,617
In this kind of cartoon picture, we're only showing a two-dimensional optimization landscape, but in
practice, our neural networks might have millions, tens of millions, hundreds of millions of parameters.

163
00:18:00,617 --> 00:18:14,221
That's hundreds of millions of directions along which this thing can move. Now among those hundreds of millions of different
directions to move, if the ratio between the largest one and the smallest one is bad, then SGD will not perform so nicely.

164
00:18:14,221 --> 00:18:20,573
You can imagine that if we have 100 million parameters, probably
the maximum ratio between those two will be quite large.

165
00:18:20,573 --> 00:18:26,398
I think this is actually quite a big problem in
practice for many high-dimensional problems.

166
00:18:27,793 --> 00:18:33,564
Another problem with SGD has to do with
this idea of local minima or saddle points.

167
00:18:33,564 --> 00:18:44,003
Here I've sort of swapped the graph a little bit. Now the X axis is showing the
value of one parameter, and then the Y axis is showing the value of the loss.

168
00:18:44,003 --> 00:18:51,583
In this top example, we have kind of this curvy objective
function, where there's a valley in the middle.

169
00:18:51,583 --> 00:18:55,036
What happens to SGD in this situation?

170
00:18:55,036 --> 00:18:57,031
- [Student] [speaks too low to hear]

171
00:18:57,031 --> 00:19:04,454
- In this situation, SGD will get stuck because at this local
minima, the gradient is zero because it's locally flat.

172
00:19:04,454 --> 00:19:09,194
Now remember with SGD, we compute the gradient
and step in the direction of opposite gradient,

173
00:19:09,194 --> 00:19:15,862
so if at our current point, the opposite gradient is zero, then we're
not going to make any progress, and we'll get stuck at this point.

174
00:19:15,862 --> 00:19:19,406
There's another problem with this idea
of saddle points.

175
00:19:19,406 --> 00:19:26,140
Rather than being a local minima, you can imagine a point where
in one direction we go up, and in the other direction we go down.

176
00:19:26,140 --> 00:19:28,953
Then at our current point,
the gradient is zero.

177
00:19:28,953 --> 00:19:35,899
Again, in this situation, the function will get stuck
at the saddle point because the gradient is zero.

178
00:19:35,899 --> 00:19:48,122
Although one thing I'd like to point out is that in one dimension, in a one-dimensional problem like this, local
minima seem like a big problem and saddle points seem like kind of not something to worry about, but in fact,

179
00:19:48,122 --> 00:19:57,171
it's the opposite once you move to very high-dimensional problems because, again, if you
think about you're in this 100 million dimensional space, what does a saddle point mean?

180
00:19:57,171 --> 00:20:03,135
That means that at my current point, some directions the
loss goes up, and some directions the loss goes down.

181
00:20:03,135 --> 00:20:09,591
If you have 100 million dimensions, that's probably going to happen more
frequently than, that's probably going to happen almost everywhere, basically.

182
00:20:09,591 --> 00:20:16,744
Whereas a local minima says that of all those 100 million directions
that I can move, every one of them causes the loss to go up.

183
00:20:16,744 --> 00:20:22,316
In fact, that seems pretty rare when you're thinking
about, again, these very high-dimensional problems.

184
00:20:23,270 --> 00:20:33,283
Really, the idea that has come to light in the last few years is that when you're training these
very large neural networks, the problem is more about saddle points and less about local minima.

185
00:20:33,283 --> 00:20:40,140
By the way, this also is a problem not just exactly
at the saddle point, but also near the saddle point.

186
00:20:40,140 --> 00:20:47,935
If you look at the example on the bottom, you see that in the regions around
the saddle point, the gradient isn't zero, but the slope is very small.

187
00:20:47,935 --> 00:20:53,611
That means that if we're, again, just stepping in the direction of
the gradient, and that gradient is very small, we're going to make

188
00:20:53,611 --> 00:21:01,872
very, very slow progress whenever our current parameter
value is near a saddle point in the objective landscape.

189
00:21:01,872 --> 00:21:10,115
This is actually a big problem.
Another problem with SGD comes from the S.

190
00:21:10,115 --> 00:21:13,521
Remember that SGD is
stochastic gradient descent.

191
00:21:13,521 --> 00:21:20,586
Recall that our loss function is typically defined by
computing the loss over many, many different examples.

192
00:21:20,586 --> 00:21:26,119
In this case, if N is your whole training set,
then that could be something like a million.

193
00:21:26,119 --> 00:21:29,347
Each time computing the loss
would be very, very expensive.

194
00:21:29,347 --> 00:21:36,957
In practice, remember that we often estimate the loss and
estimate the gradient using a small mini batch of examples.

195
00:21:36,957 --> 00:21:42,148
What this means is that we're not actually getting the
true information about the gradient at every time step.

196
00:21:42,148 --> 00:21:46,773
Instead, we're just getting some noisy
estimate of the gradient at our current point.

197
00:21:46,773 --> 00:21:50,575
Here on the right, I've kind of faked
this plot a little bit.

198
00:21:50,575 --> 00:21:59,927
I've just added random uniform noise to the gradient at every
point, and then run SGD with these noisy, messed up gradients.

199
00:21:59,927 --> 00:22:07,987
This is maybe not exactly what happens with the SGD process, but it still give
you the sense that if there's noise in your gradient estimates, then vanilla SGD

200
00:22:07,987 --> 00:22:14,036
kind of meanders around the space and might actually
take a long time to get towards the minima.

201
00:22:15,723 --> 00:22:18,966
Now that we've talked about a lot
of these problems.

202
00:22:18,966 --> 00:22:20,956
Sorry, was there a question?

203
00:22:20,956 --> 00:22:25,123
- [Student] [speaks too low to hear]

204
00:22:29,099 --> 00:22:34,435
- The question is do all of these just go
away if we use normal gradient descent?

205
00:22:35,281 --> 00:22:44,106
Let's see. I think that the taco shell problem of high condition
numbers is still a problem with full batch gradient descent.

206
00:22:44,106 --> 00:22:54,120
The noise. As we'll see, we might sometimes introduce additional noise into the network, not
only due to sampling mini batches, but also due to explicit stochasticity in the network,

207
00:22:54,120 --> 00:22:57,736
so we'll see that later.
That can still be a problem.

208
00:22:57,736 --> 00:23:05,101
Saddle points, that's still a problem for full batch gradient descent
because there can still be saddle points in the full objective landscape.

209
00:23:05,101 --> 00:23:10,249
Basically, even if we go to full batch gradient
descent, it doesn't really solve these problems.

210
00:23:10,249 --> 00:23:16,604
We kind of need to think about a slightly fancier optimization
algorithm that can try to address these concerns.

211
00:23:16,604 --> 00:23:21,966
Thankfully, there's a really, really simple strategy that
works pretty well at addressing many of these problems.

212
00:23:21,966 --> 00:23:26,978
That's this idea of adding a momentum term
to our stochastic gradient descent.

213
00:23:26,978 --> 00:23:32,923
Here on the left, we have our classic old friend, SGD, where
we just always step in the direction of the gradient.

214
00:23:32,923 --> 00:23:43,062
But now on the right, we have this minor, minor variance called SGD plus momentum,
which is now two equations and five lines of code, so it's twice as complicated.

215
00:23:43,062 --> 00:23:51,331
But it's very simple. The idea is that we maintain a velocity
over time, and we add our gradient estimates to the velocity.

216
00:23:51,331 --> 00:23:57,811
Then we step in the direction of the velocity, rather
than stepping in the direction of the gradient.

217
00:23:57,811 --> 00:24:04,825
This is very, very simple. We also have this
hyperparameter rho now which corresponds to friction.

218
00:24:05,925 --> 00:24:16,848
Now at every time step, we take our current velocity, we decay the current velocity by
the friction constant, rho, which is often something high, like .9 is a common choice.

219
00:24:16,848 --> 00:24:21,173
We take our current velocity, we decay it
by friction and we add in our gradient.

220
00:24:21,173 --> 00:24:26,999
Now we step in the direction of our velocity vector,
rather than the direction of our raw gradient vector.

221
00:24:28,327 --> 00:24:34,548
This super, super simple strategy actually helps for
all of these problems that we just talked about.

222
00:24:34,548 --> 00:24:44,809
If you think about what happens at local minima or saddle points, then if we're imagining
velocity in this system, then you kind of have this physical interpretation of this ball

223
00:24:44,809 --> 00:24:48,215
kind of rolling down the hill,
picking up speed as it comes down.

224
00:24:48,215 --> 00:24:56,922
Now once we have velocity, then even when we pass that point of local minima,
the point will still have velocity, even if it doesn't have gradient.

225
00:24:56,922 --> 00:25:01,039
Then we can hopefully get over this
local minima and continue downward.

226
00:25:01,039 --> 00:25:03,809
There's this similar
intuition near saddle points,

227
00:25:03,809 --> 00:25:10,734
where even though the gradient around the saddle point is very small,
we have this velocity vector that we've built up as we roll downhill.

228
00:25:10,734 --> 00:25:16,462
That can hopefully carry us through the saddle point
and let us continue rolling all the way down.

229
00:25:16,462 --> 00:25:21,949
If you think about what happens in
poor conditioning, now if we were to have

230
00:25:21,949 --> 00:25:31,105
these kind of zigzagging approximations to the gradient, then those zigzags
will hopefully cancel each other out pretty fast once we're using momentum.

231
00:25:31,105 --> 00:25:46,006
- This will effectively reduce the amount by which we step in the sensitive direction, whereas in the horizontal direction, our velocity will just keep building up, and will actually accelerate our descent
- across that less sensitive dimension.

232
00:25:46,006 --> 00:25:51,344
Adding momentum here can actually help us with
this high condition number problem, as well.

233
00:25:51,344 --> 00:26:05,207
Finally, on the right, we've repeated the same visualization of gradient descent with noise. Here, the black is this
vanilla SGD, which is sort of zigzagging all over the place, where the blue line is showing now SGD with momentum.

234
00:26:05,207 --> 00:26:12,644
You can see that because we're adding it, we're building up this velocity
over time, the noise kind of gets averaged out in our gradient estimates.

235
00:26:12,644 --> 00:26:20,337
Now SGD ends up taking a much smoother path towards the minima,
compared with the SGD, which is kind of meandering due to noise.

236
00:26:20,337 --> 00:26:21,532
Question?

237
00:26:21,532 --> 00:26:25,699
- [Student] [speaks too low to hear]

238
00:26:34,776 --> 00:26:40,465
- The question is how does SGD momentum help
with the poorly conditioned coordinate?

239
00:26:40,465 --> 00:26:49,125
The idea is that if you go back and look at this velocity estimate and look
at the velocity computation, we're adding in the gradient at every time step.

240
00:26:49,125 --> 00:26:56,603
It kind of depends on your setting of rho, that hyperparameter,
but you can imagine that if the gradient is relatively small,

241
00:26:56,603 --> 00:26:59,254
and if rho is well
behaved in this situation,

242
00:26:59,254 --> 00:27:05,512
then our velocity could actually monotonically increase up to a point
where the velocity could now be larger than the actual gradient.

243
00:27:05,512 --> 00:27:10,377
Then we might actually make faster progress
along the poorly conditioned dimension.

244
00:27:12,569 --> 00:27:18,020
Kind of one picture that you can have in mind
when we're doing SGD plus momentum is that

245
00:27:18,020 --> 00:27:20,273
the red here is our current point.

246
00:27:20,273 --> 00:27:30,075
At our current point, we have some red vector, which is the direction of the gradient, or rather our
estimate of the gradient at the current point. Green is now the direction of our velocity vector.

247
00:27:30,075 --> 00:27:36,317
Now when we do the momentum update, we're actually
stepping according to a weighted average of these two.

248
00:27:36,317 --> 00:27:40,049
This helps overcome some noise in our
gradient estimate.

249
00:27:40,049 --> 00:27:47,724
There's a slight variation of momentum that you sometimes see, called
Nesterov accelerated gradient, also sometimes called Nesterov momentum.

250
00:27:47,724 --> 00:27:51,737
That switches up this order of things
a little bit.

251
00:27:51,737 --> 00:28:00,285
In sort of normal SGD momentum, we imagine that we estimate the gradient at
our current point, and then take a mix of our velocity and our gradient.

252
00:28:00,285 --> 00:28:04,229
With Nesterov accelerated gradient, you do
something a little bit different.

253
00:28:04,229 --> 00:28:10,765
Here, you start at the red point. You step in the
direction of where the velocity would take you.

254
00:28:10,765 --> 00:28:18,732
You evaluate the gradient at that point. Then you go back
to your original point and kind of mix together those two.

255
00:28:18,732 --> 00:28:25,679
This is kind of a funny interpretation, but you can imagine that
you're kind of mixing together information a little bit more.

256
00:28:25,679 --> 00:28:34,702
If your velocity direction was actually a little bit wrong, it lets you incorporate
gradient information from a little bit larger parts of the objective landscape.

257
00:28:34,702 --> 00:28:39,351
This also has some really nice theoretical
properties when it comes to convex optimization,

258
00:28:39,351 --> 00:28:45,946
but those guarantees go a little bit out the window once
it comes to non-convex problems like neural networks.

259
00:28:45,946 --> 00:28:51,061
Writing it down in equations, Nesterov
momentum looks something like this, where now

260
00:28:51,061 --> 00:28:57,155
to update our velocity, we take a step, according to our
previous velocity, and evaluate that gradient there.

261
00:28:57,155 --> 00:29:06,222
Now when we take our next step, we actually step in the direction of our
velocity that's incorporating information from these multiple points.

262
00:29:06,222 --> 00:29:07,055
Question?

263
00:29:08,437 --> 00:29:12,357
- [Student] [speaks too low to hear]
- Oh, sorry.

264
00:29:12,357 --> 00:29:14,743
The question is what's a good
initialization for the velocity?

265
00:29:14,743 --> 00:29:16,998
This is almost always zero.

266
00:29:16,998 --> 00:29:20,096
It's not even a hyperparameter.
Just set it to zero and don't worry.

267
00:29:20,096 --> 00:29:21,315
Another question?

268
00:29:21,315 --> 00:29:25,482
- [Student] [speaks too low to hear]

269
00:29:31,992 --> 00:29:38,068
- Intuitively, the velocity is kind of a weighted
sum of your gradients that you've seen over time.

270
00:29:38,068 --> 00:29:41,466
- [Student] [speaks too low to hear]

271
00:29:41,466 --> 00:29:44,027
- With more recent gradients
being weighted heavier.

272
00:29:44,027 --> 00:29:49,716
At every time step, we take our old velocity, we decay
by friction and we add in our current gradient.

273
00:29:49,716 --> 00:29:54,662
You can kind of think of this as a smooth
moving average of your recent gradients

274
00:29:54,662 --> 00:30:00,109
with kind of a exponentially decaying weight
on your gradients going back in time.

275
00:30:02,627 --> 00:30:11,632
This Nesterov formulation is a little bit annoying 'cause if you look at this, normally when you
have your loss function, you want to evaluate your loss and your gradient at the same point.

276
00:30:11,632 --> 00:30:19,283
Nesterov breaks this a little bit. It's a little bit annoying to work
with. Thankfully, there's a cute change of variables you can do.

277
00:30:19,283 --> 00:30:29,392
If you do the change of variables and reshuffle a little bit, then you can write Nesterov momentum in a
slightly different way that now, again, lets you evaluate the loss and the gradient at the same point always.

278
00:30:29,392 --> 00:30:34,093
Once you make this change of variables, you get
kind of a nice interpretation of Nesterov,

279
00:30:34,093 --> 00:30:41,739
which is that here in the first step, this looks exactly like
updating the velocity in the vanilla SGD momentum case, where we

280
00:30:41,739 --> 00:30:48,178
have our current velocity, we evaluate gradient at the
current point and mix these two together in a decaying way.

281
00:30:48,178 --> 00:30:51,951
Now in the second update, now when we're actually
updating our parameter vector, if you look

282
00:30:51,951 --> 00:30:57,592
at the second equation, we have our current
point plus our current velocity plus

283
00:30:57,592 --> 00:31:01,454
a weighted difference between our current
velocity and our previous velocity.

284
00:31:01,454 --> 00:31:11,271
Here, Nesterov momentum is kind of incorporating some kind of error-correcting
term between your current velocity and your previous velocity.

285
00:31:13,029 --> 00:31:25,249
If we look at SGD, SGD momentum and Nesterov momentum on this kind of simple problem, compared with SGD, we
notice that SGD kind of takes this, SGD is in the black, kind of taking this slow progress toward the minima.

286
00:31:26,346 --> 00:31:29,598
The blue and the green show momentum
and Nesterov.

287
00:31:29,598 --> 00:31:36,803
These have this behavior of kind of overshooting the minimum 'cause
they're building up velocity going past the minimum, and then kind of

288
00:31:36,803 --> 00:31:39,849
correcting themselves and coming back
towards the minima.

289
00:31:39,849 --> 00:31:40,682
Question?

290
00:31:42,023 --> 00:31:46,190
- [Student] [speaks too low to hear]

291
00:31:52,024 --> 00:31:58,050
- The question is this picture looks good, but what happens
if your minima call lies in this very narrow basin?

292
00:31:58,050 --> 00:32:01,527
Will the velocity just cause you
to skip right over that minima?

293
00:32:01,527 --> 00:32:05,232
That's actually a really interesting point, and
the subject of some recent theoretical work,

294
00:32:05,232 --> 00:32:09,071
but the idea is that maybe those really
sharp minima are actually bad minima.

295
00:32:09,071 --> 00:32:17,601
We don't want to even land in those 'cause the idea is that maybe if you have
a very sharp minima, that actually could be a minima that overfits more.

296
00:32:17,601 --> 00:32:22,026
If you imagine that we doubled our training set,
the whole optimization landscape would change,

297
00:32:22,026 --> 00:32:27,420
and maybe that very sensitive minima would actually
disappear if we were to collect more training data.

298
00:32:27,420 --> 00:32:31,189
We kind of have this intuition that we
maybe want to land in very flat minima

299
00:32:31,189 --> 00:32:35,933
because those very flat minima are probably
more robust as we change the training data.

300
00:32:35,933 --> 00:32:40,453
Those flat minima might actually
generalize better to testing data.

301
00:32:40,453 --> 00:32:46,284
This is again, sort of very recent theoretical work, but
that's actually a really good point that you bring it up.

302
00:32:46,284 --> 00:32:54,354
In some sense, it's actually a feature and not a bug that
SGD momentum actually skips over those very sharp minima.

303
00:32:55,979 --> 00:32:59,979
That's actually a good
thing, believe it or not.

304
00:33:00,825 --> 00:33:04,316
Another thing you can see is if you look at the
difference between momentum and Nesterov here,

305
00:33:04,316 --> 00:33:12,715
you can see that because of the correction factor in Nesterov, maybe it's
not overshooting quite as drastically, compared to vanilla momentum.

306
00:33:14,683 --> 00:33:20,068
There's another kind of common optimization
strategy is this algorithm called AdaGrad,

307
00:33:20,068 --> 00:33:25,292
which John Duchi, who's now a professor
here, worked on during his Ph.D.

308
00:33:25,292 --> 00:33:37,663
The idea with AdaGrad is that as you, during the course of the optimization, you're going to keep
a running estimate or a running sum of all the squared gradients that you see during training.

309
00:33:39,569 --> 00:33:43,957
Now rather than having a velocity term,
instead we have this grad squared term.

310
00:33:43,957 --> 00:33:49,199
During training, we're going to just keep adding
the squared gradients to this grad squared term.

311
00:33:49,199 --> 00:33:57,449
Now when we update our parameter vector, we'll divide by
this grad squared term when we're making our update step.

312
00:33:59,334 --> 00:34:07,261
The question is what does this kind of scaling do in this
situation where we have a very high condition number?

313
00:34:08,393 --> 00:34:12,560
- [Student] [speaks too low to hear]

314
00:34:16,256 --> 00:34:22,904
- The idea is that if we have two coordinates, one that always has a
very high gradient and one that always has a very small gradient,

315
00:34:22,904 --> 00:34:35,181
then as we add the sum of the squares of the small gradient, we're going to be dividing by a
small number, so we'll accelerate movement along the slow dimension, along the one dimension.

316
00:34:35,181 --> 00:34:45,924
Then along the other dimension, where the gradients tend to be very large, then we'll be dividing
by a large number, so we'll kind of slow down our progress along the wiggling dimension.

317
00:34:45,924 --> 00:34:56,093
But there's kind of a problem here. That's the question of what happens with
AdaGrad over the course of training, as t gets larger and larger and larger?

318
00:34:56,094 --> 00:34:58,391
- [Student] [speaks too low to hear]

319
00:34:58,391 --> 00:35:02,239
- With AdaGrad, the steps actually get smaller
and smaller and smaller because we just

320
00:35:02,239 --> 00:35:09,895
continue updating this estimate of the squared gradients over time, so this
estimate just grows and grows and grows monotonically over the course of training.

321
00:35:09,895 --> 00:35:15,359
Now this causes our step size to get
smaller and smaller and smaller over time.

322
00:35:15,359 --> 00:35:20,334
Again, in the convex case, there's some really
nice theory showing that this is actually

323
00:35:20,334 --> 00:35:28,125
really good 'cause in the convex case, as you approach a
minimum, you kind of want to slow down so you actually converge.

324
00:35:28,125 --> 00:35:31,192
That's actually kind of a feature
in the convex case.

325
00:35:31,192 --> 00:35:42,007
But in the non-convex case, that's a little bit problematic because as you come towards a saddle
point, you might get stuck with AdaGrad, and then you kind of no longer make any progress.

326
00:35:42,007 --> 00:35:48,678
There's a slight variation of AdaGrad, called RMSProp,
that actually addresses this concern a little bit.

327
00:35:48,678 --> 00:35:53,390
Now with RMSProp, we still keep this estimate
of the squared gradients, but instead

328
00:35:53,390 --> 00:36:01,085
of just letting that squared estimate continually accumulate over
training, instead, we let that squared estimate actually decay.

329
00:36:01,085 --> 00:36:09,340
This ends up looking kind of like a momentum update, except we're having kind of
momentum over the squared gradients, rather than momentum over the actual gradients.

330
00:36:09,340 --> 00:36:20,361
Now with RMSProp, after we compute our gradient, we take our current estimate of the grad
squared, we multiply it by this decay rate, which is commonly something like .9 or .99.

331
00:36:20,361 --> 00:36:26,601
Then we add in this one minus the decay
rate of our current squared gradient.

332
00:36:26,601 --> 00:36:37,193
Now over time, you can imagine that. Then again, when we make our step, the
step looks exactly the same as AdaGrad, where we divide by the squared gradient

333
00:36:37,193 --> 00:36:44,070
in the step to again have this nice property of accelerating movement along
the one dimension, and slowing down movement along the other dimension.

334
00:36:44,070 --> 00:36:52,411
But now with RMSProp, because these estimates are leaky, then it kind of
addresses the problem of maybe always slowing down where you might not want to.

335
00:36:56,455 --> 00:37:04,173
Here again, we're kind of showing our favorite toy problem
with SGD in black, SGD momentum in blue and RMSProp in red.

336
00:37:04,173 --> 00:37:12,263
You can see that RMSProp and SGD momentum are both doing much better
than SGD, but their qualitative behavior is a little bit different.

337
00:37:12,263 --> 00:37:17,488
With SGD momentum, it kind of overshoots
the minimum and comes back, whereas with

338
00:37:17,488 --> 00:37:26,392
RMSProp, it's kind of adjusting its trajectory such that we're
making approximately equal progress among all the dimensions.

339
00:37:26,392 --> 00:37:34,412
By the way, you can't actually tell, but this plot is also showing
AdaGrad in green with the same learning rate, but it just

340
00:37:34,412 --> 00:37:38,606
gets stuck due to this problem of
continually decaying learning rates.

341
00:37:38,606 --> 00:37:45,553
In practice, AdaGrad is maybe not so common for many of these
things. That's a little bit of an unfair comparison of AdaGrad.

342
00:37:46,392 --> 00:37:52,558
Probably you need to increase the learning rate with AdaGrad, and
then it would end up looking kind of like RMSProp in this case.

343
00:37:52,558 --> 00:37:57,148
But in general, we tend not to use AdaGrad
so much when training neural networks.

344
00:37:57,148 --> 00:37:57,981
Question?

345
00:37:57,981 --> 00:37:59,796
- [Student] [speaks too low to hear]

346
00:37:59,796 --> 00:38:04,387
- The answer is yes, this problem
is convex, but in this case,

347
00:38:07,146 --> 00:38:12,003
it's a little bit of an unfair comparison because the
learning rates are not so comparable among the methods.

348
00:38:12,003 --> 00:38:17,290
I've been a little bit unfair to AdaGrad in this visualization by
showing the same learning rate between the different algorithms,

349
00:38:17,290 --> 00:38:23,132
when probably you should have separately
turned the learning rates per algorithm.

350
00:38:27,970 --> 00:38:35,403
We saw in momentum, we had this idea of velocity, where we're building up velocity
by adding in the gradients, and then stepping in the direction of the velocity.

351
00:38:35,403 --> 00:38:43,744
We saw with AdaGrad and RMSProp that we had this other idea, of building up an
estimate of the squared gradients, and then dividing by the squared gradients.

352
00:38:43,744 --> 00:38:46,658
Then these both seem like good ideas
on their own.

353
00:38:46,658 --> 00:38:50,898
Why don't we just stick 'em together and use
them both? Maybe that would be even better.

354
00:38:50,898 --> 00:38:56,741
That brings us to this algorithm called Adam,
or rather brings us very close to Adam.

355
00:38:56,741 --> 00:39:01,119
We'll see in a couple slides that there's
a slight correction we need to make here.

356
00:39:01,119 --> 00:39:06,888
Here with Adam, we maintain an estimate
of the first moment and the second moment.

357
00:39:06,888 --> 00:39:14,667
Now in the red, we make this estimate of the
first moment as a weighed sum of our gradients.

358
00:39:14,667 --> 00:39:22,741
We have this moving estimate of the second moment, like AdaGrad and
like RMSProp, which is a moving estimate of our squared gradients.

359
00:39:22,741 --> 00:39:28,621
Now when we make our update step, we step using
both the first moment, which is kind of our

360
00:39:28,621 --> 00:39:37,281
velocity, and also divide by the second moment, or rather the square
root of the second moment, which is this squared gradient term.

361
00:39:38,128 --> 00:39:46,269
This idea of Adam ends up looking a little bit like RMSProp plus momentum,
or ends up looking like momentum plus second squared gradients.

362
00:39:46,269 --> 00:39:51,989
It kind of incorporates the nice properties of
both. But there's a little bit of a problem here.

363
00:39:51,989 --> 00:40:06,134
That's the question of what happens at the very first time step? At the very first time
step, you can see that at the beginning, we've initialized our second moment with zero.

364
00:40:06,134 --> 00:40:16,803
Now after one update of the second moment, typically this beta two, second
moment decay rate, is something like .9 or .99, something very close to one.

365
00:40:18,235 --> 00:40:22,867
After one update, our second moment
is still very, very close to zero.

366
00:40:22,867 --> 00:40:32,377
Now when we're making our update step here and we divide by our second moment, now we're
dividing by a very small number. We're making a very, very large step at the beginning.

367
00:40:32,377 --> 00:40:37,768
This very, very large step at the beginning is
not really due to the geometry of the problem.

368
00:40:37,768 --> 00:40:43,422
It's kind of an artifact of the fact that we
initialized our second moment estimate was zero.

369
00:40:43,422 --> 00:40:44,322
Question?

370
00:40:44,322 --> 00:40:48,489
- [Student] [speaks too low to hear]

371
00:40:52,832 --> 00:41:00,365
- That's true. The comment is that if your first moment is also very
small, then you're multiplying by small and you're dividing by square root

372
00:41:00,365 --> 00:41:02,906
of small squared, so
what's going to happen?

373
00:41:02,906 --> 00:41:05,746
They might cancel each other
out, you might be okay.

374
00:41:05,746 --> 00:41:13,632
That's true. Sometimes these cancel each other out and you're okay, but
sometimes this ends up in taking very large steps right at the beginning.

375
00:41:13,632 --> 00:41:16,245
That can be quite bad.

376
00:41:16,245 --> 00:41:19,533
Maybe you initialize a little bit poorly.
You take a very large step.

377
00:41:19,533 --> 00:41:26,145
Now your initialization is completely messed up, and then you're in a very
bad part of the objective landscape and you just can't converge from there.

378
00:41:26,145 --> 00:41:27,165
Question?

379
00:41:27,165 --> 00:41:30,915
- [Student] [speaks too low to hear]

380
00:41:30,915 --> 00:41:37,847
- The idea is what is this 10 to the minus seven term in the last
equation? That's actually appeared in AdaGrad, RMSProp and Adam.

381
00:41:37,847 --> 00:41:42,187
The idea is that we're dividing by something. We
want to make sure we're not dividing by zero,

382
00:41:42,187 --> 00:41:48,609
so we always add a small positive constant to the
denominator, just to make sure we're not dividing by zero.

383
00:41:48,609 --> 00:41:56,012
That's technically a hyperparameter, but it tends not to matter too much, so just
setting 10 to minus seven, 10 to minus eight, something like that, tends to work well.

384
00:41:57,967 --> 00:42:05,511
With Adam, remember we just talked about this idea of at the first couple steps,
it gets very large, and we might take very large steps and mess ourselves up.

385
00:42:05,511 --> 00:42:12,510
Adam also adds this bias correction term to avoid this
problem of taking very large steps at the beginning.

386
00:42:12,510 --> 00:42:22,619
You can see that after we update our first and second moments, we create an unbiased
estimate of those first and second moments by incorporating the current time step, t.

387
00:42:22,619 --> 00:42:29,550
Now we actually make our step using these unbiased estimates,
rather than the original first and second moment estimates.

388
00:42:29,550 --> 00:42:33,167
This gives us our full form of Adam.

389
00:42:33,167 --> 00:42:45,550
By the way, Adam is a really, [laughs] really good optimization algorithm, and it works really well for a lot of different
problems, so that's kind of my default optimization algorithm for just about any new problem that I'm tackling.

390
00:42:45,550 --> 00:42:53,088
In particular, if you set beta one equals .9, beta two equals
.999, learning rate one e minus three or five e minus four,

391
00:42:53,088 --> 00:42:58,797
that's a great staring point for just about
all the architectures I've ever worked with.

392
00:42:58,797 --> 00:43:03,518
Try that. That's a really good
place to start, in general.

393
00:43:03,518 --> 00:43:05,949
[laughs]

394
00:43:05,949 --> 00:43:11,634
If we actually plot these things out and look at SGD,
SGD momentum, RMSProp and Adam on the same problem,

395
00:43:11,634 --> 00:43:18,094
you can see that Adam, in the purple here, kind of
combines elements of both SGD momentum and RMSProp.

396
00:43:18,094 --> 00:43:25,175
Adam kind of overshoots the minimum a little bit like SGD
momentum, but it doesn't overshoot quite as much as momentum.

397
00:43:25,175 --> 00:43:33,268
Adam also has this similar behavior of RMSProp of kind of
trying to curve to make equal progress along all dimensions.

398
00:43:33,268 --> 00:43:37,706
Maybe in this small two-dimensional example,
Adam converged about similarly to other ones,

399
00:43:37,706 --> 00:43:42,832
but you can see qualitatively that it's kind of
combining the behaviors of both momentum and RMSProp.

400
00:43:45,042 --> 00:43:48,709
Any questions about
optimization algorithms?

401
00:43:50,048 --> 00:43:56,606
- [Student] [speaks too low to hear] They still take
a very long time to train. [speaks too low to hear]

402
00:43:56,606 --> 00:44:03,193
- The question is what does Adam not fix? Would these neural
networks are still large, they still take a long time to train.

403
00:44:04,744 --> 00:44:07,098
There can still be a problem.

404
00:44:07,098 --> 00:44:11,979
In this picture where we have this landscape
of things looking like ovals, if you imagine

405
00:44:11,979 --> 00:44:19,219
that we're kind of making estimates along each dimension
independently to allow us to speed up or slow down along different

406
00:44:19,219 --> 00:44:26,576
coordinate axes, but one problem is that if that taco shell
is kind of tilted and is not axis aligned, then we're still

407
00:44:26,576 --> 00:44:29,887
only making estimates along the
individual axes independently.

408
00:44:30,935 --> 00:44:38,131
That corresponds to taking your rotated taco shell and squishing it
horizontally and vertically, but you can't actually unrotate it.

409
00:44:38,131 --> 00:44:48,732
In cases where you have this kind of rotated picture of poor conditioning, then
Adam or any of these other algorithms really can't address that, that concern.

410
00:44:51,356 --> 00:44:57,706
Another thing that we've seen in all these optimization
algorithms is learning rate as a hyperparameter.

411
00:44:57,706 --> 00:45:01,828
We've seen this picture before a couple times,
that as you use different learning rates,

412
00:45:01,828 --> 00:45:05,097
sometimes if it's too high, it
might explode in the yellow.

413
00:45:05,097 --> 00:45:09,629
If it's a very low learning rate, in the blue,
it might take a very long time to converge.

414
00:45:09,629 --> 00:45:11,933
It's kind of tricky to pick the right
learning rate.

415
00:45:13,712 --> 00:45:19,308
This is a little bit of a trick question because we don't actually have
to stick with one learning rate throughout the course of training.

416
00:45:19,308 --> 00:45:29,705
Sometimes you'll see people decay the learning rates over time, where we can kind of combine
the effects of these different curves on the left, and get the nice properties of each.

417
00:45:29,705 --> 00:45:39,366
Sometimes you'll start with a higher learning rate near the start of training, and then
decay the learning rate and make it smaller and smaller throughout the course of training.

418
00:45:39,366 --> 00:45:46,795
A couple strategies for these would be a step decay, where at 100,000th
iteration, you just decay by some factor and you keep going.

419
00:45:46,795 --> 00:45:52,579
You might see an exponential decay, where
you continually decay during training.

420
00:45:52,579 --> 00:45:57,598
You might see different variations of continually
decaying the learning rate during training.

421
00:45:57,598 --> 00:46:04,347
If you look at papers, especially the resonate paper, you often see plots
that look kind of like this, where the loss is kind of going down,

422
00:46:04,347 --> 00:46:07,898
then dropping, then flattening again,
then dropping again.

423
00:46:07,898 --> 00:46:11,312
What's going on in these plots is that
they're using a step decay learning rate,

424
00:46:11,312 --> 00:46:18,401
where at these parts where it plateaus and then suddenly drops again, those
are the iterations where they dropped the learning rate by some factor.

425
00:46:18,401 --> 00:46:26,243
This idea of dropping the learning rate, you might imagine that
it got near some good region, but now the gradients got smaller,

426
00:46:26,243 --> 00:46:28,066
it's kind of bouncing around too much.

427
00:46:28,066 --> 00:46:32,745
Then if we drop the learning rate, it lets it slow
down and continue to make progress down the landscape.

428
00:46:32,745 --> 00:46:36,475
This tends to help in practice sometimes.

429
00:46:36,475 --> 00:46:44,973
Although one thing to point out is that learning rate decay is a little bit more
common with SGD momentum, and a little bit less common with something like Adam.

430
00:46:44,973 --> 00:46:50,458
Another thing I'd like to point out is that learning
rate decay is kind of a second-order hyperparameter.

431
00:46:50,458 --> 00:46:53,324
You typically should not optimize
over this thing from the start.

432
00:46:53,324 --> 00:47:00,877
Usually when you're kind of getting networks to work at the beginning, you
want to pick a good learning rate with no learning rate decay from the start.

433
00:47:00,877 --> 00:47:06,068
Trying to cross-validate jointly over learning rate decay and
initial learning rate and other things, you'll just get confused.

434
00:47:06,068 --> 00:47:10,581
What you do for setting learning rate decay
is try with no decay, see what happens.

435
00:47:10,581 --> 00:47:15,427
Then kind of eyeball the loss curve and
see where you think you might need decay.

436
00:47:16,860 --> 00:47:24,948
Another thing I wanted to mention briefly is this idea of all these
algorithms that we've talked about are first-order optimization algorithms.

437
00:47:24,948 --> 00:47:33,064
In this picture, in this one-dimensional picture, we have this
kind of curvy objective function at our current point in red.

438
00:47:33,064 --> 00:47:36,057
What we're basically doing is computing
the gradient at that point.

439
00:47:36,057 --> 00:47:40,722
We're using the gradient information to compute
some linear approximation to our function,

440
00:47:40,722 --> 00:47:44,208
which is kind of a first-order Taylor
approximation to our function.

441
00:47:44,208 --> 00:47:51,814
Now we pretend that the first-order approximation is our actual
function, and we make a step to try to minimize the approximation.

442
00:47:51,814 --> 00:47:57,353
But this approximation doesn't hold for very large
regions, so we can't step too far in that direction.

443
00:47:57,353 --> 00:48:04,509
But really, the idea here is that we're only incorporating information about the
first derivative of the function. You can actually go a little bit fancier.

444
00:48:04,509 --> 00:48:11,248
There's this idea of second-order approximation, where we take into
account both first derivative and second derivative information.

445
00:48:11,248 --> 00:48:18,449
Now we make a second-order Taylor approximation to our function
and kind of locally approximate our function with a quadratic.

446
00:48:18,449 --> 00:48:22,281
Now with a quadratic, you can step right
to the minimum, and you're really happy.

447
00:48:22,281 --> 00:48:25,769
That's this idea of
second-order optimization.

448
00:48:25,769 --> 00:48:30,489
When you generalize this to multiple dimensions,
you get something called the Newton step,

449
00:48:30,489 --> 00:48:35,066
where you compute this Hessian matrix,
which is a matrix of second derivatives,

450
00:48:35,066 --> 00:48:43,689
and you end up inverting this Hessian matrix in order to step directly
to the minimum of this quadratic approximation to your function.

451
00:48:43,689 --> 00:48:48,910
Does anyone spot something that's quite different about this
update rule, compared to the other ones that we've seen?

452
00:48:48,910 --> 00:48:51,107
- [Student] [speaks too low to hear]

453
00:48:51,107 --> 00:48:54,328
- This doesn't have a learning rate.
That's kind of cool.

454
00:48:56,463 --> 00:49:00,664
We're making this quadratic approximation and we're
stepping right to the minimum of the quadratic.

455
00:49:00,664 --> 00:49:04,681
At least in this vanilla version of Newton's
method, you don't actually need a learning rate.

456
00:49:04,681 --> 00:49:07,849
You just always step to the minimum
at every time step.

457
00:49:07,849 --> 00:49:13,265
However, in practice, you might end up, have a learning rate anyway
because, again, that quadratic approximation might not be perfect,

458
00:49:13,265 --> 00:49:21,055
so you might only want to step in the direction towards the minimum, rather than actually
stepping to the minimum, but at least in this vanilla version, it doesn't have a learning rate.

459
00:49:23,994 --> 00:49:27,366
But unfortunately, this is maybe
a little bit impractical for deep learning

460
00:49:27,366 --> 00:49:34,519
because this Hessian matrix is N by N, where N
is the number of parameters in your network.

461
00:49:34,519 --> 00:49:38,498
If N is 100 million, then 100
million squared is way too big.

462
00:49:38,498 --> 00:49:42,046
You definitely can't store that in memory,
and you definitely can't invert it.

463
00:49:42,046 --> 00:49:46,486
In practice, people sometimes use these
quasi-Newton methods that, rather than working

464
00:49:46,486 --> 00:49:52,725
with the full Hessian and inverting the full Hessian, they
work with approximations. Low-rank approximations are common.

465
00:49:52,725 --> 00:49:57,092
You'll sometimes see
these for some problems.

466
00:49:57,092 --> 00:50:03,487
L-BFGS is one particular second-order optimizer that has this
approximate second, keeps this approximation of the Hessian

467
00:50:03,487 --> 00:50:11,205
that you'll sometimes see, but in practice, it doesn't work too
well for many deep learning problems because these approximations,

468
00:50:11,205 --> 00:50:16,410
these second-order approximations, don't really
handle the stochastic case very much, very nicely.

469
00:50:16,410 --> 00:50:20,616
They also tend not to work so well with
non-convex problems.

470
00:50:20,616 --> 00:50:23,142
I don't want to get into
that right now too much.

471
00:50:23,142 --> 00:50:29,022
In practice, what you should really do is probably Adam is a
really good choice for many different neural network things,

472
00:50:29,022 --> 00:50:38,974
but if you're in a situation where you can afford to do full batch updates, and you know that
your problem doesn't have really any stochasticity, then L-BFGS is kind of a good choice.

473
00:50:38,974 --> 00:50:43,181
L-BFGS doesn't really get used for training
neural networks too much, but as we'll see

474
00:50:43,181 --> 00:50:47,251
in a couple of lectures, it does sometimes
get used for things like style transfer,

475
00:50:47,251 --> 00:50:54,356
where you actually have less stochasticity and fewer parameters,
but you still want to solve an optimization problem.

476
00:50:55,834 --> 00:51:00,992
All of these strategies we've talked about
so far are about reducing training error.

477
00:51:02,344 --> 00:51:07,452
All these optimization algorithms are really about driving down
your training error and minimizing your objective function,

478
00:51:07,452 --> 00:51:10,403
but we don't really care about
training error that much.

479
00:51:10,403 --> 00:51:13,203
Instead, we really care about
our performance on unseen data.

480
00:51:13,203 --> 00:51:16,817
We really care about reducing this gap
between train and test error.

481
00:51:16,817 --> 00:51:21,228
The question is once we're already
good at optimizing our objective function,

482
00:51:21,228 --> 00:51:25,535
what can we do to try to reduce this gap and
make our model perform better on unseen data?

483
00:51:28,497 --> 00:51:33,617
One really quick and dirty, easy thing
to try is this idea of model ensembles

484
00:51:33,617 --> 00:51:36,767
that sometimes works across many
different areas in machine learning.

485
00:51:36,767 --> 00:51:44,588
The idea is pretty simple. Rather than having just one model, we'll train
10 different models independently from different initial random restarts.

486
00:51:44,588 --> 00:51:51,333
Now at test time, we'll run our data through all of the
10 models and average the predictions of those 10 models.

487
00:51:53,562 --> 00:52:01,555
Adding these multiple models together tends to reduce overfitting a little bit
and tend to improve performance a little bit, typically by a couple percent.

488
00:52:01,555 --> 00:52:05,302
This is generally not a drastic improvement,
but it is a consistent improvement.

489
00:52:05,302 --> 00:52:13,263
You'll see that in competitions, like ImageNet and other things like
that, using model ensembles is very common to get maximal performance.

490
00:52:14,488 --> 00:52:20,482
You can actually get a little bit creative with this. Sometimes rather
than training separate models independently, you can just keep multiple

491
00:52:20,482 --> 00:52:25,928
snapshots of your model during the course of
training, and then use these as your ensembles.

492
00:52:25,928 --> 00:52:29,804
Then you still, at test time, need to average
the predictions of these multiple snapshots,

493
00:52:29,804 --> 00:52:33,244
but you can collect the snapshots during
the course of training.

494
00:52:34,133 --> 00:52:43,210
There's actually a very nice paper being presented at ICLR this week that kind of
has a fancy version of this idea, where we use a crazy learning rate schedule,

495
00:52:43,210 --> 00:52:47,996
where our learning rate goes very slow, then
very fast, then very slow, then very fast.

496
00:52:47,996 --> 00:52:57,631
The idea is that with this crazy learning rate schedule, then over the course of training, the model
might be able to converge to different regions in the objective landscape that all are reasonably good.

497
00:52:58,717 --> 00:53:05,532
If you do an ensemble over these different snapshots, then you can improve your
performance quite nicely, even though you're only training the model once.

498
00:53:05,532 --> 00:53:11,198
Questions?
- [Student] [speaks too low to hear]

499
00:53:25,388 --> 00:53:33,413
- The question is, it's bad when there's a large gap between error 'cause that
means you're overfitting, but if there's no gap, then is that also maybe bad?

500
00:53:33,413 --> 00:53:37,446
Do we actually want some small,
optimal gap between the two?

501
00:53:37,446 --> 00:53:39,132
We don't really care about the gap.

502
00:53:39,132 --> 00:53:44,019
What we really care about is maximizing
the performance on the validation set.

503
00:53:44,019 --> 00:53:54,995
What tends to happen is that if you don't see a gap, then you could have improved
your absolute performance, in many cases, by overfitting a little bit more.

504
00:53:54,995 --> 00:54:02,720
There's this weird correlation between the absolute performance on the validation
set and the size of that gap. We only care about absolute performance.

505
00:54:02,720 --> 00:54:03,735
Question in the back?

506
00:54:03,735 --> 00:54:07,004
- [Student] Are hyperparameters the same
for the ensemble?

507
00:54:07,004 --> 00:54:09,528
- Are the hyperparameters the same
for the ensembles?

508
00:54:09,528 --> 00:54:12,234
That's a good question.
Sometimes they're not.

509
00:54:12,234 --> 00:54:19,614
You might want to try different sizes of the model, different learning rates,
different regularization strategies and ensemble across these different things.

510
00:54:19,614 --> 00:54:22,614
That actually does happen sometimes.

511
00:54:23,496 --> 00:54:31,769
Another little trick you can do sometimes is that during training, you might actually
keep an exponentially decaying average of your parameter vector itself to kind of have

512
00:54:31,769 --> 00:54:35,778
a smooth ensemble of your own network
during training.

513
00:54:35,778 --> 00:54:41,649
Then use this smoothly decaying average of your parameter
vector, rather than the actual checkpoints themselves.

514
00:54:41,649 --> 00:54:45,262
This is called Polyak averaging,
and it sometimes helps a little bit.

515
00:54:45,262 --> 00:54:50,838
It's just another one of these small tricks you can
sometimes add, but it's not maybe too common in practice.

516
00:54:50,838 --> 00:54:55,778
Another question you might have is that how can we
actually improve the performance of single models?

517
00:54:57,229 --> 00:55:02,503
When we have ensembles, we still need to run, like,
10 models at test time. That's not so great.

518
00:55:02,503 --> 00:55:06,219
We really want some strategies to improve
the performance of our single models.

519
00:55:06,219 --> 00:55:08,237
That's really this idea of regularization,

520
00:55:08,237 --> 00:55:11,954
where we add something to our model to
prevent it from fitting the training data

521
00:55:11,954 --> 00:55:16,203
too well in the attempts to make
it perform better on unseen data.

522
00:55:16,203 --> 00:55:23,515
We've seen a couple ideas, a couple methods for regularization
already, where we add some explicit extra term to the loss.

523
00:55:23,515 --> 00:55:29,738
Where we have this one term telling the model to fit the
data, and another term that's a regularization term.

524
00:55:29,738 --> 00:55:33,032
You saw this in homework one,
where we used L2 regularization.

525
00:55:34,804 --> 00:55:43,001
As we talked about in lecture a couple lectures ago, this L2 regularization
doesn't really make maybe a lot of sense in the context of neural networks.

526
00:55:43,922 --> 00:55:47,982
Sometimes we use other
things for neural networks.

527
00:55:47,982 --> 00:55:53,376
One regularization strategy that's super, super
common for neural networks is this idea of dropout.

528
00:55:53,376 --> 00:55:55,080
Dropout is super simple.

529
00:55:55,080 --> 00:56:02,264
Every time we do a forward pass through the network, at every
layer, we're going to randomly set some neurons to zero.

530
00:56:02,264 --> 00:56:08,688
Every time we do a forward pass, we'll set a different random subset
of the neurons to zero. This kind of proceeds one layer at a time.

531
00:56:08,688 --> 00:56:15,193
We run through one layer, we compute the value of the layer, we randomly
set some of them to zero, and then we continue up through the network.

532
00:56:15,193 --> 00:56:22,445
Now if you look at this fully connected network on the left versus
a dropout version of the same network on the right, you can see

533
00:56:22,445 --> 00:56:30,400
that after we do dropout, it kind of looks like a smaller version of
the same network, where we're only using some subset of the neurons.

534
00:56:30,400 --> 00:56:35,746
This subset that we use varies at
each iteration, at each forward pass.

535
00:56:35,746 --> 00:56:36,732
Question?

536
00:56:36,732 --> 00:56:40,899
- [Student] [speaks too low to hear]

537
00:56:43,694 --> 00:56:46,375
- The question is what are we setting
to zero? It's the activations.

538
00:56:46,375 --> 00:56:51,731
Each layer is computing previous activation times
the weight matrix gives you our next activation.

539
00:56:51,731 --> 00:57:01,592
Then you just take that activation, set some of them to zero, and then your next layer
will be partially zeroed activations times another matrix give you your next activations.

540
00:57:01,592 --> 00:57:03,155
Question?

541
00:57:03,155 --> 00:57:06,702
- [Student] [speaks too low to hear]

542
00:57:06,702 --> 00:57:08,751
- Question is which
layers do you do this on?

543
00:57:08,751 --> 00:57:14,454
It's more common in fully connected layers, but you
sometimes see this in convolutional layers, as well.

544
00:57:14,454 --> 00:57:23,423
When you're working in convolutional layers, sometimes instead of dropping each
activation randomly, instead you sometimes might drop entire feature maps randomly.

545
00:57:24,455 --> 00:57:30,117
In convolutions, you have this channel dimension, and you
might drop out entire channels, rather than random elements.

546
00:57:32,059 --> 00:57:38,480
Dropout is kind of super simple in practice. It only
requires adding two lines, one line per dropout call.

547
00:57:38,480 --> 00:57:41,572
Here we have a three-layer neural network,
and we've added dropout.

548
00:57:41,572 --> 00:57:49,460
You can see that all we needed to do was add this extra line where we
randomly set some things to zero. This is super easy to implement.

549
00:57:49,460 --> 00:57:52,138
But the question is why
is this even a good idea?

550
00:57:52,138 --> 00:57:58,067
We're seriously messing with the network at training
time by setting a bunch of its values to zero.

551
00:57:58,067 --> 00:58:00,988
How can this possibly make sense?

552
00:58:00,988 --> 00:58:08,532
One sort of slightly hand wavy idea that people have is
that dropout helps prevent co-adaptation of features.

553
00:58:09,622 --> 00:58:15,853
Maybe if you imagine that we're trying to classify cats,
maybe in some universe, the network might learn one neuron

554
00:58:15,853 --> 00:58:21,066
for having an ear, one neuron for having a
tail, one neuron for the input being furry.

555
00:58:21,066 --> 00:58:24,751
Then it kind of combines these things
together to decide whether or not it's a cat.

556
00:58:24,751 --> 00:58:32,831
But now if we have dropout, then in making the final decision about
catness, the network cannot depend too much on any of these one features.

557
00:58:32,831 --> 00:58:37,725
Instead, it kind of needs to distribute its
idea of catness across many different features.

558
00:58:37,725 --> 00:58:42,205
This might help prevent
overfitting somehow.

559
00:58:42,205 --> 00:58:50,347
Another interpretation of dropout that's come out a little bit more recently
is that it's kind of like doing model ensembling within a single model.

560
00:58:51,690 --> 00:58:58,745
If you look at the picture on the left, after you apply dropout to the network,
we're kind of computing this subnetwork using some subset of the neurons.

561
00:58:58,745 --> 00:59:03,391
Now every different potential dropout mask
leads to a different potential subnetwork.

562
00:59:03,391 --> 00:59:09,145
Now dropout is kind of learning a whole ensemble of
networks all at the same time that all share parameters.

563
00:59:09,145 --> 00:59:13,790
By the way, because of the number of potential
dropout masks grows exponentially in the number

564
00:59:13,790 --> 00:59:17,152
of neurons, you're never going to sample
all of these things.

565
00:59:18,089 --> 00:59:24,788
This is really a gigantic, gigantic ensemble of
networks that are all being trained simultaneously.

566
00:59:25,622 --> 00:59:29,128
Then the question is what
happens at test time?

567
00:59:29,128 --> 00:59:34,158
Once we move to dropout, we've kind of fundamentally
changed the operation of our neural network.

568
00:59:34,158 --> 00:59:42,850
Previously, we've had our neural network, f, be a function of the
weights, w, and the inputs, x, and then produce the output, y.

569
00:59:42,850 --> 00:59:48,268
But now, our network is also taking this additional
input, z, which is some random dropout mask.

570
00:59:48,268 --> 00:59:52,732
That z is random. Having randomness
at test time is maybe bad.

571
00:59:52,732 --> 00:59:57,444
Imagine that you're working at Facebook, and you want
to classify the images that people are uploading.

572
00:59:57,444 --> 01:00:03,092
Then today, your image gets classified as a cat, and tomorrow
it doesn't. That would be really weird and really bad.

573
01:00:03,092 --> 01:00:09,323
You'd probably want to eliminate this stochasticity
at test time once the network is already trained.

574
01:00:09,323 --> 01:00:12,093
Then we kind of want to average out
this randomness.

575
01:00:12,093 --> 01:00:18,131
If you write this out, you can imagine actually marginalizing
out this randomness with some integral, but in practice,

576
01:00:18,131 --> 01:00:24,368
this integral is totally intractable. We don't know
how to evaluate this thing. You're in bad shape.

577
01:00:24,368 --> 01:00:28,073
One thing you might imagine doing is
approximating this integral via sampling,

578
01:00:28,073 --> 01:00:31,484
where you draw multiple samples of z
and then average them out at test time,

579
01:00:31,484 --> 01:00:36,040
but this still would introduce some
randomness, which is little bit bad.

580
01:00:36,040 --> 01:00:41,423
Thankfully, in the case of dropout, we can actually
approximate this integral in kind of a cheap way locally.

581
01:00:41,423 --> 01:00:47,228
If we consider a single neuron, the output is a, the
inputs are x and y, with two weights, w one, w two.

582
01:00:47,228 --> 01:00:52,622
Then at test time, our value a is just
w one x plus w two y.

583
01:00:53,590 --> 01:01:00,645
Now imagine that we trained to this network. During training,
we used dropout with probability 1/2 of dropping our neurons.

584
01:01:00,645 --> 01:01:06,317
Now the expected value of a during training, we can
kind of compute analytically for this small case.

585
01:01:07,712 --> 01:01:12,249
There's four possible dropout masks, and we're going
to average out the values across these four masks.

586
01:01:12,249 --> 01:01:18,204
We can see that the expected value of a
during training is 1/2 w one x plus w two y.

587
01:01:19,075 --> 01:01:29,000
There's this disconnect between this average value of w one x plus w two y
at test time, and at training time, the average value is only 1/2 as much.

588
01:01:29,000 --> 01:01:34,883
One cheap thing we can do is that at test
time, we don't have any stochasticity.

589
01:01:34,883 --> 01:01:40,736
Instead, we just multiply this output by the dropout
probability. Now these expected values are the same.

590
01:01:40,736 --> 01:01:44,733
This is kind of like a local cheap
approximation to this complex integral.

591
01:01:44,733 --> 01:01:48,576
This is what people really commonly do
in practice with dropout.

592
01:01:49,715 --> 01:01:56,269
At dropout, we have this predict function, and we just
multiply our outputs of the layer by the dropout probability.

593
01:01:56,269 --> 01:01:59,393
The summary of dropout is that it's
really simple on the forward pass.

594
01:01:59,393 --> 01:02:03,807
You're just adding two lines to your
implementation to randomly zero out some nodes.

595
01:02:03,807 --> 01:02:10,209
Then at the test time prediction function, you just
added one little multiplication by your probability.

596
01:02:10,209 --> 01:02:16,613
Dropout is super simple. It tends to work well
sometimes for regularizing neural networks.

597
01:02:16,613 --> 01:02:21,454
By the way, one common trick you see
sometimes is this idea of inverted dropout.

598
01:02:22,665 --> 01:02:28,735
Maybe at test time, you care more about efficiency, so you
want to eliminate that extra multiplication by p at test time.

599
01:02:28,735 --> 01:02:37,677
Then what you can do is, at test time, you use the entire weight matrix, but now at
training time, instead you divide by p because training is probably happening on a GPU.

600
01:02:37,677 --> 01:02:44,733
You don't really care if you do one extra multiply at training time, but then
at test time, you kind of want this thing to be as efficient as possible.

601
01:02:44,733 --> 01:02:45,566
Question?

602
01:02:46,416 --> 01:02:56,777
- [Student] [speaks too low to hear]
Now the gradient [speaks too low to hear].

603
01:02:57,678 --> 01:03:02,212
- The question is what happens to the
gradient during training with dropout?

604
01:03:02,212 --> 01:03:06,583
You're right. We only end up propagating the
gradients through the nodes that were not dropped.

605
01:03:06,583 --> 01:03:15,356
This has the consequence that when you're training with dropout, typically training
takes longer because at each step, you're only updating some subparts of the network.

606
01:03:15,356 --> 01:03:22,287
When you're using dropout, it typically takes longer to train,
but you might have a better generalization after it's converged.

607
01:03:24,409 --> 01:03:32,810
Dropout, we kind of saw is like this one concrete instantiation. There's a
little bit more general strategy for regularization where during training

608
01:03:32,810 --> 01:03:37,482
we add some kind of randomness to the network to
prevent it from fitting the training data too well.

609
01:03:37,482 --> 01:03:41,037
To kind of mess it up and prevent it
from fitting the training data perfectly.

610
01:03:41,037 --> 01:03:46,160
Now at test time, we want to average out all that
randomness to hopefully improve our generalization.

611
01:03:46,160 --> 01:03:53,927
Dropout is probably the most common example of this type of strategy,
but actually batch normalization kind of fits this idea, as well.

612
01:03:53,927 --> 01:04:00,755
Remember in batch normalization, during training, one data point might
appear in different mini batches with different other data points.

613
01:04:00,755 --> 01:04:07,200
There's a bit of stochasticity with respect to a single data point
with how exactly that point gets normalized during training.

614
01:04:07,200 --> 01:04:14,735
But now at test time, we kind of average out this stochasticity by using some
global estimates to normalize, rather than the per mini batch estimates.

615
01:04:14,735 --> 01:04:20,223
Actually batch normalization tends to have kind of a similar
regularizing effect as dropout because they both introduce

616
01:04:20,223 --> 01:04:25,478
some kind of stochasticity or noise at training
time, but then average it out at test time.

617
01:04:25,478 --> 01:04:35,744
Actually, when you train networks with batch normalization, sometimes you don't use dropout at
all, and just the batch normalization adds enough of a regularizing effect to your network.

618
01:04:35,744 --> 01:04:43,833
Dropout is somewhat nice because you can actually tune the regularization strength
by varying that parameter p, and there's no such control in batch normalization.

619
01:04:43,833 --> 01:04:48,928
Another kind of strategy that fits in this
paradigm is this idea of data augmentation.

620
01:04:48,928 --> 01:04:57,078
During training, in a vanilla version for training, we have our data,
we have our label. We use it to update our CNN at each time step.

621
01:04:57,078 --> 01:05:03,555
But instead, what we can do is randomly transform the image
in some way during training such that the label is preserved.

622
01:05:03,555 --> 01:05:09,418
Now we train on these random transformations
of the image rather than the original images.

623
01:05:09,418 --> 01:05:16,153
Sometimes you might see random horizontal flips 'cause if
you take a cat and flip it horizontally, it's still a cat.

624
01:05:17,690 --> 01:05:23,763
You'll randomly sample crops of different sizes from the
image because the random crop of the cat is still a cat.

625
01:05:25,188 --> 01:05:30,317
Then during testing, you kind of average out
this stochasticity by evaluating with some

626
01:05:30,317 --> 01:05:34,309
fixed set of crops, often the four corners
and the middle and their flips.

627
01:05:34,309 --> 01:05:38,041
What's very common is that when you read, for
example, papers on ImageNet, they'll report

628
01:05:38,041 --> 01:05:47,308
a single crop performance of their model, which is just like the whole image, and a 10
crop performance of their model, which are these five standard crops plus their flips.

629
01:05:48,238 --> 01:05:56,345
Also with data augmentation, you'll sometimes use color jittering, where you
might randomly vary the contrast or brightness of your image during training.

630
01:05:56,345 --> 01:06:04,642
You can get a little bit more complex with color jittering, as well, where you try to
make color jitters that are maybe in the PCA directions of your data space or whatever,

631
01:06:04,642 --> 01:06:11,456
where you do some color jittering in some data-dependent
way, but that's a little bit less common.

632
01:06:12,492 --> 01:06:18,037
In general, data augmentation is this really general
thing that you can apply to just about any problem.

633
01:06:18,037 --> 01:06:24,940
Whatever problem you're trying to solve, you kind of think about what
are the ways that I can transform my data without changing the label?

634
01:06:24,940 --> 01:06:31,218
Now during training, you just apply these random transformations
to your input data. This sort of has a regularizing effect

635
01:06:31,218 --> 01:06:38,954
on the network because you're, again, adding some kind of stochasticity
during training, and then marginalizing it out at test time.

636
01:06:40,055 --> 01:06:45,232
Now we've seen three examples of this pattern,
dropout, batch normalization, data augmentation,

637
01:06:45,232 --> 01:06:47,154
but there's many other examples, as well.

638
01:06:47,154 --> 01:06:53,049
Once you have this pattern in your mind, you'll kind of
recognize this thing as you read other papers sometimes.

639
01:06:53,049 --> 01:06:56,722
There's another kind of related
idea to dropout called DropConnect.

640
01:06:56,722 --> 01:07:06,265
With DropConnect, it's the same idea, but rather than zeroing out the activations at every
forward pass, instead we randomly zero out some of the values of the weight matrix instead.

641
01:07:06,265 --> 01:07:09,652
Again, it kind of has this similar flavor.

642
01:07:09,652 --> 01:07:16,281
Another kind of cool idea that I like, this one's not so
commonly used, but I just think it's a really cool idea,

643
01:07:16,281 --> 01:07:19,400
is this idea of fractional max pooling.

644
01:07:19,400 --> 01:07:29,067
Normally when you do two-by-two max pooling, you have these fixed two-by-two regions
over which you pool over in the forward pass, but now with fractional max pooling,

645
01:07:29,067 --> 01:07:35,851
every time we have our pooling layer, we're going to randomize
exactly the pool that the regions over which we pool.

646
01:07:35,851 --> 01:07:43,070
Here in the example on the right, I've shown three different sets
of random pooling regions that you might see during training.

647
01:07:43,070 --> 01:07:48,857
Now during test time, you kind of average the
stochasticity out by trying many different,

648
01:07:48,857 --> 01:07:54,704
by either sticking to some fixed set of pooling regions.
or drawing many samples and averaging over them.

649
01:07:54,704 --> 01:07:59,027
That's kind of a cool idea, even though
it's not so commonly used.

650
01:07:59,027 --> 01:08:05,890
Another really kind of surprising paper in this paradigm that
actually came out in the last year, so this is new since

651
01:08:05,890 --> 01:08:09,911
the last time we taught the class,
is this idea of stochastic depth.

652
01:08:09,911 --> 01:08:15,490
Here we have a network on the left. The
idea is that we have a very deep network.

653
01:08:15,490 --> 01:08:18,530
We're going to randomly drop layers
from the network during training.

654
01:08:18,530 --> 01:08:24,113
During training, we're going to eliminate some layers
and only use some subset of the layers during training.

655
01:08:24,114 --> 01:08:26,854
Now during test time, we'll
use the whole network.

656
01:08:26,854 --> 01:08:30,251
This is kind of crazy.
It's kind of amazing that this works,

657
01:08:30,251 --> 01:08:35,310
but this tends to have kind of a similar regularizing
effect as dropout and these other strategies.

658
01:08:35,310 --> 01:08:42,041
But again, this is super, super cutting-edge research. This is
not super commonly used in practice, but it is a cool idea.

659
01:08:44,694 --> 01:08:52,673
Any last minute questions about regularization?
No? Use it. It's a good idea. Yeah?

660
01:08:52,673 --> 01:08:57,046
- [Student] [speaks too low to hear]

661
01:08:57,046 --> 01:09:01,184
- The question is do you usually use
more than one regularization method?

662
01:09:04,325 --> 01:09:09,751
You should generally be using batch normalization as kind of
a good thing to have in most networks nowadays because it

663
01:09:09,752 --> 01:09:12,650
helps you converge, especially
for very deep things.

664
01:09:12,650 --> 01:09:25,204
In many cases, batch normalization alone tends to be enough, but then sometimes if batch normalization alone
is not enough, then you can consider adding dropout or other thing once you see your network overfitting.

665
01:09:25,204 --> 01:09:28,526
You generally don't do a blind
cross-validation over these things.

666
01:09:28,526 --> 01:09:33,942
Instead, you add them in in a targeted way
once you see your network is overfitting.

667
01:09:36,400 --> 01:09:38,981
One quick thing, it's this
idea of transfer learning.

668
01:09:38,981 --> 01:09:47,018
We've kind of seen with regularization, we can help reduce the gap between
train and test error by adding these different regularization strategies.

669
01:09:48,903 --> 01:09:53,012
One problem with overfitting is sometimes you
overfit 'cause you don't have enough data.

670
01:09:53,012 --> 01:10:00,444
You want to use a big, powerful model, but that big, powerful
model just is going to overfit too much on your small dataset.

671
01:10:00,444 --> 01:10:05,909
Regularization is one way to combat that, but
another way is through using transfer learning.

672
01:10:05,909 --> 01:10:12,730
Transfer learning kind of busts this myth that you don't
need a huge amount of data in order to train a CNN.

673
01:10:12,730 --> 01:10:15,300
The idea is really simple.

674
01:10:15,300 --> 01:10:20,798
You'll maybe first take some CNN.
Here is kind of a VGG style architecture.

675
01:10:20,798 --> 01:10:25,031
You'll take your CNN, you'll train it
in a very large dataset, like ImageNet,

676
01:10:25,031 --> 01:10:28,039
where you actually have enough data
to train the whole network.

677
01:10:28,039 --> 01:10:34,596
Now the idea is that you want to apply the features from
this dataset to some small dataset that you care about.

678
01:10:34,596 --> 01:10:42,864
Maybe instead of classifying the 1,000 ImageNet categories, now you want to
classify 10 dog breeds or something like that. You only have a small dataset.

679
01:10:42,864 --> 01:10:45,917
Here, our small dataset
only has C classes.

680
01:10:45,917 --> 01:10:58,135
Then what you'll typically do is for this last fully connected layer that is going from the last
layer features to the final class scores, this now, you need to reinitialize that matrix randomly.

681
01:10:59,651 --> 01:11:02,952
For ImageNet, it was a 4,096-by-1,000
dimensional matrix.

682
01:11:02,952 --> 01:11:09,182
Now for your new classes, it might
be 4,096-by-C or by 10 or whatever.

683
01:11:09,182 --> 01:11:13,985
You reinitialize this last matrix randomly,
freeze the weights of all the previous layers

684
01:11:13,985 --> 01:11:21,947
and now just basically train a linear classifier, and only train the
parameters of this last layer and let it converge on your data.

685
01:11:23,788 --> 01:11:28,756
This tends to work pretty well if you only
have a very small dataset to work with.

686
01:11:28,756 --> 01:11:35,166
Now if you have a little bit more data, another thing
you can try is actually fine tuning the whole network.

687
01:11:35,166 --> 01:11:44,935
After that top layer converges and after you learn that last layer for your data,
then you can consider actually trying to update the whole network, as well.

688
01:11:44,935 --> 01:11:49,434
If you have more data, then you might consider
updating larger parts of the network.

689
01:11:49,434 --> 01:11:56,143
A general strategy here is that when you're updating the network,
you want to drop the learning rate from its initial learning rate

690
01:11:56,143 --> 01:12:02,973
because probably the original parameters in this network that
converged on ImageNet probably worked pretty well generally,

691
01:12:02,973 --> 01:12:08,605
and you just want to change them a very small
amount to tune performance for your dataset.

692
01:12:08,605 --> 01:12:15,490
Then when you're working with transfer learning, you kind of imagine
this two-by-two grid of scenarios where on the one side, you have

693
01:12:15,490 --> 01:12:20,113
maybe very small amounts of data for your dataset,
or very large amount of data for your dataset.

694
01:12:21,188 --> 01:12:28,780
Then maybe your data is very similar to images. Like, ImageNet
has a lot of pictures of animals and plants and stuff like that.

695
01:12:28,780 --> 01:12:35,335
If you want to just classify other types of animals and plants and
other types of images like that, then you're in pretty good shape.

696
01:12:35,335 --> 01:12:48,861
Then generally what you do is if your data is very similar to something like ImageNet, if you have a very small amount
of data, you can just basically train a linear classifier on top of features, extracted using an ImageNet model.

697
01:12:48,861 --> 01:12:54,786
If you have a little bit more data to work with,
then you might imagine fine tuning your data.

698
01:12:54,786 --> 01:12:58,755
However, you sometimes get in trouble if your
data looks very different from ImageNet.

699
01:12:58,755 --> 01:13:06,781
Maybe if you're working with maybe medical images that are X-rays or CAT scans
or something that looks very different from images in ImageNet, in that case,

700
01:13:06,781 --> 01:13:09,072
you maybe need to get a
little bit more creative.

701
01:13:09,072 --> 01:13:14,408
Sometimes it still works well here, but those
last layer features might not be so informative.

702
01:13:14,408 --> 01:13:21,507
You might consider reinitializing larger parts of the network and
getting a little bit more creative and trying more experiments here.

703
01:13:21,507 --> 01:13:29,015
This is somewhat mitigated if you have a large amount of data in your very different
dataset 'cause then you can actually fine tune larger parts of the network.

704
01:13:29,015 --> 01:13:32,587
Another point I'd like to make is this idea
of transfer learning is super pervasive.

705
01:13:32,587 --> 01:13:35,660
It's actually the norm,
rather than the exception.

706
01:13:35,660 --> 01:13:40,562
As you read computer vision papers, you'll often
see system diagrams like this for different tasks.

707
01:13:40,562 --> 01:13:44,706
On the left, we're working with object detection.
On the right, we're working with image captioning.

708
01:13:44,706 --> 01:13:48,387
Both of these models have a CNN
that's kind of processing the image.

709
01:13:48,387 --> 01:13:53,913
In almost all applications of computer vision these days,
most people are not training these things from scratch.

710
01:13:53,913 --> 01:13:59,973
Almost always, that CNN will be pretrained on ImageNet,
and then potentially fine tuned for the task at hand.

711
01:13:59,973 --> 01:14:07,089
Also, in the captioning sense, sometimes you can actually
pretrain some word vectors relating to the language, as well.

712
01:14:07,089 --> 01:14:14,143
You maybe pretrain the CNN on ImageNet, pretrain some word vectors on a
large text corpus, and then fine tune the whole thing for your dataset.

713
01:14:14,143 --> 01:14:22,278
Although in the case of captioning, I think this pretraining with word
vectors tends to be a little bit less common and a little bit less critical.

714
01:14:22,278 --> 01:14:33,225
The takeaway for your projects, and more generally as you work on different models, is that whenever you have
some large dataset, whenever you have some problem that you want to tackle, but you don't have a large dataset,

715
01:14:33,225 --> 01:14:41,673
then what you should generally do is download some pretrained model
that's relatively close to the task you care about, and then either

716
01:14:41,673 --> 01:14:44,859
reinitialize parts of that model or
fine tune that model for your data.

717
01:14:44,859 --> 01:14:50,676
That tends to work pretty well, even if you have
only a modest amount of training data to work with.

718
01:14:50,676 --> 01:14:57,738
Because this is such a common strategy, all of the different deep learning
software packages out there provide a model zoo where you can just download

719
01:14:57,738 --> 01:15:01,099
pretrained versions of various models.

720
01:15:01,099 --> 01:15:06,043
In summary today, we talked about optimization,
which is about how to improve the training loss.

721
01:15:06,043 --> 01:15:10,884
We talked about regularization, which is
improving your performance on the test data.

722
01:15:10,884 --> 01:15:12,838
Model ensembling kind of fit into there.

723
01:15:12,838 --> 01:15:17,440
We also talked about transfer learning, which is
how you can actually do better with less data.

724
01:15:17,440 --> 01:15:21,940
These are all super useful strategies. You
should use them in your projects and beyond.

725
01:15:21,940 --> 01:15:25,238
Next time, we'll talk more concretely about some
of the different deep learning software pakages out there